Data science has become a popular subject among individuals in business and commerce. Each day, new colleges and universities create degree programs in the field. The two biggest skill sets companies look for in applicants are Python and R. Python, with the myriad of libraries available, has become the language of choice for data science. However, there are alternatives, and one of those is the R programming language. R is the open source version of S, a statistical language developed several decades ago. R is free and the source code is available to developers. The R language contains datasets and functions useful for analysis and is used in colleges and universities worldwide. A tool used to access R functionality is R Studio, developed by Dr. Hadley Wickham. Dr. Wickham is also the creator of the tidyverse library, which aggregates several libraries to clean and manage data. In this article, we take a basic tour of R functions and use R Studio to demonstrate functionality. This tutorial is not an introduction but a demonstration of what you can do with the language.
To start, look at the interface for R Studio:
This is the setup on my HP laptop - notice there are several windows. The first window is the R console, which allows me to enter functions and see real-time output. I can create scripts, import datasets, and see what is in my files on Windows 10 using the interface. There is a commercial version that has additional functionality, but my version is the Community Version.
Here are some sample functions and examples we will try:
mean()
median()
mode()
max()
min()
IQR()
var()
sd()
These are the basic functions which describe our data numerically. To begin, I will create a vector of sample values. When our values are small, we can use vectors to analyze our data. We declare and initialize vectors like this:x <- c(98, 76, 88, 95, 100, 107, 88, 66)
This tells R to assign our vector of values to our x-variable. A useful feature of R Studio is the
Environment window, which shows us what is in our vector and verifies we have declared the variable correctly in R. Here is the output from this window:
\
Next, I call each function and get the results in our console:
In the above example, we calculated the basic statistics found within our vector. Notice the value for mode(). Since we did not have repeating values, it returned “numeric,” signifying that there is no mode for the data. To calculate the mode, we must have at least one repeating value. The mode can tell us if we have a multimodal distribution of data. The minimum and maximum values were 107 and 66, respectively. This means that our data set has a variation of values between our minimum and maximum. The IQR() function calculates the difference between the first and third quartiles of our data. All of the functions demonstrated are measures of central tendency for our data. Calling these functions in R can show us how our data looks and make general assumptions about our data. Even though I did not list boxplot() as an example function to examine, nevertheless, I will call the function to generate a box plot of vector data:
The box plot shows that our mean is 89.75 with the bar drawn close to the value of 90. Our median was found to be 91.5. To do some more investigation, we can draw a histogram using the hist() function:
The distribution is skewed left, which mean that the median > mean. Thus, the bulk of the values in our sample data set fall between 80 and 100.
In this post, I introduced the R programming language and looked at the measures of central tendency to illustrate how to use the language. You have seen the R Studio interface and learned what it looks like when you first use the IDE. Finally, we generated a boxplot and histogram to see how our data looks and make inferences about the distribution of the data.